{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Use Ragas to evaluate RAG pipeline\n",
    "\n",
    "Ragas is an open source project for evaluating RAG components.  [Paper](https://arxiv.org/abs/2309.15217), [Code](https://docs.ragas.io/en/stable/getstarted/index.html), [Docs](https://docs.ragas.io/en/stable/getstarted/index.html), [Intro blog](https://medium.com/towards-data-science/rag-evaluation-using-ragas-4645a4c6c477).\n",
    "\n",
    "<div>\n",
    "<img src=\"../../images/ragas_eval_image.png\" width=\"80%\"/>\n",
    "</div>\n",
    "\n",
    "**Please note that RAGAS can use a large amount of OpenAI api token consumption.** <br> \n",
    "\n",
    "Read through this notebook carefully and pay attention to the number of questions and metrics you want to evaluate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### 1. Prepare Ragas environment and ground truth data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# ! python -m pip install langchain openai dataset ragas pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Question</th>\n",
       "      <th>ground_truth_answer</th>\n",
       "      <th>Sources</th>\n",
       "      <th>Custom_RAG_context</th>\n",
       "      <th>Custom_RAG_answer</th>\n",
       "      <th>llama3_answer</th>\n",
       "      <th>anthropic_claud3_haiku_answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What do the parameters for HNSW mean?</td>\n",
       "      <td># M: maximum degree of nodes in a layer of the...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>this value can improve recall rate at the cost...</td>\n",
       "      <td>The parameters for HNSW are as follows:\\n\\n# M...</td>\n",
       "      <td>The parameters for HNSW include M, which is th...</td>\n",
       "      <td>The parameters for HNSW (Hierarchical Navigabl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are good default values for HNSW paramete...</td>\n",
       "      <td>M=16, efConstruction=32, ef=32</td>\n",
       "      <td>https://milvus.io/docs/index.md, https://milvu...</td>\n",
       "      <td>parameters vary with Milvus distribution. Sele...</td>\n",
       "      <td>M=16, efConstruction=500, and ef=64</td>\n",
       "      <td>For a Milvus distribution, there is no direct ...</td>\n",
       "      <td>I don't know. The context provided does not co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does nlist mean in ivf_flat?</td>\n",
       "      <td>The `nlist` parameter in IVF_FLAT index divide...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>index? IVF_FLAT index divides a vector space i...</td>\n",
       "      <td>In IVF_FLAT, nlist refers to the number of clu...</td>\n",
       "      <td>The `nlist` parameter in IVF_FLAT index divide...</td>\n",
       "      <td>The nlist parameter in the IVF_FLAT index in M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the default AUTOINDEX distance metric ...</td>\n",
       "      <td>Trick answer:  IP inner product, not yet updat...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>and Hamming. This type of indexes include BIN_...</td>\n",
       "      <td>L2</td>\n",
       "      <td>According to the Milvus documentation, the def...</td>\n",
       "      <td>The default AUTOINDEX distance metric in Milvu...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Question  \\\n",
       "0              What do the parameters for HNSW mean?   \n",
       "1  What are good default values for HNSW paramete...   \n",
       "2                  What does nlist mean in ivf_flat?   \n",
       "3  What is the default AUTOINDEX distance metric ...   \n",
       "\n",
       "                                 ground_truth_answer  \\\n",
       "0  # M: maximum degree of nodes in a layer of the...   \n",
       "1                     M=16, efConstruction=32, ef=32   \n",
       "2  The `nlist` parameter in IVF_FLAT index divide...   \n",
       "3  Trick answer:  IP inner product, not yet updat...   \n",
       "\n",
       "                                             Sources  \\\n",
       "0                    https://milvus.io/docs/index.md   \n",
       "1  https://milvus.io/docs/index.md, https://milvu...   \n",
       "2                    https://milvus.io/docs/index.md   \n",
       "3                    https://milvus.io/docs/index.md   \n",
       "\n",
       "                                  Custom_RAG_context  \\\n",
       "0  this value can improve recall rate at the cost...   \n",
       "1  parameters vary with Milvus distribution. Sele...   \n",
       "2  index? IVF_FLAT index divides a vector space i...   \n",
       "3  and Hamming. This type of indexes include BIN_...   \n",
       "\n",
       "                                   Custom_RAG_answer  \\\n",
       "0  The parameters for HNSW are as follows:\\n\\n# M...   \n",
       "1                M=16, efConstruction=500, and ef=64   \n",
       "2  In IVF_FLAT, nlist refers to the number of clu...   \n",
       "3                                                 L2   \n",
       "\n",
       "                                       llama3_answer  \\\n",
       "0  The parameters for HNSW include M, which is th...   \n",
       "1  For a Milvus distribution, there is no direct ...   \n",
       "2  The `nlist` parameter in IVF_FLAT index divide...   \n",
       "3  According to the Milvus documentation, the def...   \n",
       "\n",
       "                       anthropic_claud3_haiku_answer  \n",
       "0  The parameters for HNSW (Hierarchical Navigabl...  \n",
       "1  I don't know. The context provided does not co...  \n",
       "2  The nlist parameter in the IVF_FLAT index in M...  \n",
       "3  The default AUTOINDEX distance metric in Milvu...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Read questions and ground truth answers into a pandas dataframe.\n",
    "# Note: Surround each context string with ''' to avoid issues with quotes inside.\n",
    "# Note: Separate each context string with a comma.\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "# Get the current working directory.\n",
    "cwd = os.getcwd()\n",
    "relative_path = '/data/ground_truth_answers.csv'\n",
    "file_path = cwd + relative_path\n",
    "\n",
    "# Read ground truth answers from file.\n",
    "eval_df = pd.read_csv(file_path, header=0, skip_blank_lines=True)\n",
    "display(eval_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Question</th>\n",
       "      <th>ground_truth_answer</th>\n",
       "      <th>Sources</th>\n",
       "      <th>Custom_RAG_context</th>\n",
       "      <th>Custom_RAG_answer</th>\n",
       "      <th>llama3_answer</th>\n",
       "      <th>anthropic_claud3_haiku_answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What do the parameters for HNSW mean?</td>\n",
       "      <td># M: maximum degree of nodes in a layer of the...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>this value can improve recall rate at the cost...</td>\n",
       "      <td>The parameters for HNSW include M, which is th...</td>\n",
       "      <td>The parameters for HNSW include M, which is th...</td>\n",
       "      <td>The parameters for HNSW (Hierarchical Navigabl...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are good default values for HNSW paramete...</td>\n",
       "      <td>M=16, efConstruction=32, ef=32</td>\n",
       "      <td>https://milvus.io/docs/index.md, https://milvu...</td>\n",
       "      <td>parameters vary with Milvus distribution. Sele...</td>\n",
       "      <td>For a Milvus distribution, there is no direct ...</td>\n",
       "      <td>For a Milvus distribution, there is no direct ...</td>\n",
       "      <td>I don't know. The context provided does not co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does nlist mean in ivf_flat?</td>\n",
       "      <td>The `nlist` parameter in IVF_FLAT index divide...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>index? IVF_FLAT index divides a vector space i...</td>\n",
       "      <td>The `nlist` parameter in IVF_FLAT index divide...</td>\n",
       "      <td>The `nlist` parameter in IVF_FLAT index divide...</td>\n",
       "      <td>The nlist parameter in the IVF_FLAT index in M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the default AUTOINDEX distance metric ...</td>\n",
       "      <td>Trick answer:  IP inner product, not yet updat...</td>\n",
       "      <td>https://milvus.io/docs/index.md</td>\n",
       "      <td>and Hamming. This type of indexes include BIN_...</td>\n",
       "      <td>According to the Milvus documentation, the def...</td>\n",
       "      <td>According to the Milvus documentation, the def...</td>\n",
       "      <td>The default AUTOINDEX distance metric in Milvu...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Question  \\\n",
       "0              What do the parameters for HNSW mean?   \n",
       "1  What are good default values for HNSW paramete...   \n",
       "2                  What does nlist mean in ivf_flat?   \n",
       "3  What is the default AUTOINDEX distance metric ...   \n",
       "\n",
       "                                 ground_truth_answer  \\\n",
       "0  # M: maximum degree of nodes in a layer of the...   \n",
       "1                     M=16, efConstruction=32, ef=32   \n",
       "2  The `nlist` parameter in IVF_FLAT index divide...   \n",
       "3  Trick answer:  IP inner product, not yet updat...   \n",
       "\n",
       "                                             Sources  \\\n",
       "0                    https://milvus.io/docs/index.md   \n",
       "1  https://milvus.io/docs/index.md, https://milvu...   \n",
       "2                    https://milvus.io/docs/index.md   \n",
       "3                    https://milvus.io/docs/index.md   \n",
       "\n",
       "                                  Custom_RAG_context  \\\n",
       "0  this value can improve recall rate at the cost...   \n",
       "1  parameters vary with Milvus distribution. Sele...   \n",
       "2  index? IVF_FLAT index divides a vector space i...   \n",
       "3  and Hamming. This type of indexes include BIN_...   \n",
       "\n",
       "                                   Custom_RAG_answer  \\\n",
       "0  The parameters for HNSW include M, which is th...   \n",
       "1  For a Milvus distribution, there is no direct ...   \n",
       "2  The `nlist` parameter in IVF_FLAT index divide...   \n",
       "3  According to the Milvus documentation, the def...   \n",
       "\n",
       "                                       llama3_answer  \\\n",
       "0  The parameters for HNSW include M, which is th...   \n",
       "1  For a Milvus distribution, there is no direct ...   \n",
       "2  The `nlist` parameter in IVF_FLAT index divide...   \n",
       "3  According to the Milvus documentation, the def...   \n",
       "\n",
       "                       anthropic_claud3_haiku_answer  \n",
       "0  The parameters for HNSW (Hierarchical Navigabl...  \n",
       "1  I don't know. The context provided does not co...  \n",
       "2  The nlist parameter in the IVF_FLAT index in M...  \n",
       "3  The default AUTOINDEX distance metric in Milvu...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Replace Custom_RAG_context column with LLM answer of your choice.\n",
    "\n",
    "# Possible choices:\n",
    "# 1. openai gpt-3.5-turbo = 'Custom_RAG_answer'\n",
    "# 2. llama3_answer\n",
    "# 3. anthropic_claud3_haiku_answer\n",
    "# LLM_TO_EVALUATE = 'Custom_RAG_answer'\n",
    "LLM_TO_EVALUATE = 'llama3_answer'\n",
    "# LLM_TO_EVALUATE = 'anthropic_claud3_haiku_answer'\n",
    "\n",
    "temp_df = eval_df.copy()\n",
    "if LLM_TO_EVALUATE != 'Custom_RAG_answer':\n",
    "    temp_df['Custom_RAG_answer'] = temp_df[LLM_TO_EVALUATE]\n",
    "\n",
    "# Display the dataframe.\n",
    "display(temp_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ragas default uses HuggingFace Datasets.\n",
    "# https://docs.ragas.io/en/latest/getstarted/evaluation.html\n",
    "from datasets import Dataset\n",
    "\n",
    "def assemble_ragas_dataset(input_df):\n",
    "    \"\"\"Assemble a RAGAS HuggingFace Dataset from an input pandas df.\"\"\"\n",
    "\n",
    "    # Assemble Ragas lists: questions, ground_truth_answers, retrieval_contexts, and RAG answers.\n",
    "    question_list, truth_list, context_list = [], [], []\n",
    "\n",
    "    # Get all the questions.\n",
    "    question_list = input_df.Question.to_list()\n",
    "\n",
    "    # Get all the ground truth answers.\n",
    "    truth_list = input_df.ground_truth_answer.to_list()\n",
    "\n",
    "    # Get all the Milvus Retrieval Contexts as list[list[str]]\n",
    "    context_list = input_df.Custom_RAG_context.to_list()\n",
    "    context_list = [[context] for context in context_list]\n",
    "\n",
    "    # Get all the RAG answers based on contexts.\n",
    "    rag_answer_list = input_df.Custom_RAG_answer.to_list()\n",
    "\n",
    "    # Create a HuggingFace Dataset from the ground truth lists.\n",
    "    ragas_ds = Dataset.from_dict({\"question\": question_list,\n",
    "                            \"contexts\": context_list,\n",
    "                            \"answer\": rag_answer_list,\n",
    "                            \"ground_truth\": truth_list\n",
    "                            })\n",
    "    return ragas_ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a Ragas HuggingFace Dataset from the ground truth lists.\n",
    "ragas_input_ds = assemble_ragas_dataset(eval_df)\n",
    "display(ragas_input_ds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Debugging inspect all the data.\n",
    "ragas_input_df = ragas_input_ds.to_pandas()\n",
    "display(ragas_input_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### 2. Start Ragas Evaluation with custom Evaluation LLM\n",
    "\n",
    "The default OpenAI model used by Ragas is `gpt-3.5-turbo-16k`.\n",
    "\n",
    "Note that a large amount of OpenAI api token is consumed. Every time you ask a question and every evaluation, you will ask the OpenAI service. Please pay attention to your token consumption. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, openai, pprint\n",
    "from openai import OpenAI\n",
    "\n",
    "# Save api key in env variable.\n",
    "# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety\n",
    "openai_api_key=os.environ['OPENAI_API_KEY']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the metrics you want to see.\n",
    "# Remove context relevancy metric - it is deprecated and not maintained.\n",
    "from ragas.metrics import (\n",
    "    context_recall, \n",
    "    context_precision, \n",
    "    # answer_relevancy,\n",
    "    # answer_similarity,\n",
    "    # answer_correctness,\n",
    "    # faithfulness,\n",
    "    )\n",
    "metrics = ['context_recall', 'context_precision']\n",
    "\n",
    "# Change the default the llm-as-critic.\n",
    "# It is also possible to switch out a HuggingFace open LLM here if you want.\n",
    "# https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html\n",
    "from ragas.llms import llm_factory\n",
    "LLM_NAME = \"gpt-3.5-turbo\"\n",
    "# Default temperature = 1e-8\n",
    "ragas_llm = ragas.llms.llm_factory(model=LLM_NAME)\n",
    "\n",
    "# Change the default embeddings to HuggingFace models.\n",
    "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
    "from ragas.embeddings import LangchainEmbeddingsWrapper\n",
    "EMB_NAME = \"BAAI/bge-large-en-v1.5\"\n",
    "lc_embeddings = HuggingFaceEmbeddings(model_name=EMB_NAME)\n",
    "\n",
    "# # Alternatively use OpenAI embedding models.\n",
    "# # https://openai.com/blog/new-embedding-models-and-api-updates\n",
    "# from langchain_openai.embeddings import OpenAIEmbeddings\n",
    "# lc_embeddings = OpenAIEmbeddings(\n",
    "#     model=\"text-embedding-3-small\", \n",
    "#     # 512 or 1536 possible for 3-small\n",
    "#     # 256, 1024, or 3072 for 3-large\n",
    "#     dimensions=512)\n",
    "ragas_emb = LangchainEmbeddingsWrapper(embeddings=lc_embeddings)\n",
    "\n",
    "# Change each metric.\n",
    "for metric in metrics:\n",
    "    globals()[metric].llm = ragas_llm\n",
    "    globals()[metric].embeddings = ragas_emb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Evaluate the dataset.\n",
    "from ragas import evaluate\n",
    "\n",
    "ragas_result = evaluate(\n",
    "    ragas_input_ds,\n",
    "    metrics=[\n",
    "        context_precision,\n",
    "        context_recall,\n",
    "    ],\n",
    "    llm=ragas_llm,\n",
    ")\n",
    "\n",
    "# View evaluations.\n",
    "ragas_output_df = ragas_result.to_pandas()\n",
    "temp = ragas_output_df.fillna(0.0)\n",
    "temp['context_f1'] = 2.0 * temp.context_precision * temp.context_recall \\\n",
    "                    / (temp.context_precision + temp.context_recall)\n",
    "temp.head()\n",
    "\n",
    "# Calculate Retrieval average score.\n",
    "avg_retrieval_f1 = np.round(temp.context_f1.mean(),2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display Retrieval average score.\n",
    "print(f\"Using {eval_df.shape[0]} eval questions, Mean Retrieval F1 Score = {avg_retrieval_f1, 2}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Props to Sebastian Raschka for this handy watermark.\n",
    "# !pip install watermark\n",
    "\n",
    "%load_ext watermark\n",
    "%watermark -a 'Christy Bergman' -v -p datasets,langchain,openai,ragas --conda"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
